17.3 Gene Identification
255
17.2
DNA Methylation Profiling
Epigenetic information is lost during standard Sanger sequencing or NGS because
the methylated groups are treated as cytosine by the enzymes involved in PCR.
Although the overall proportion of methylated DNA can be determined chemically,
in order to properly understand the regulatory rôle of methylation, it is necessary
to determine the methylation status of each base in sequence (bearing in mind that
only CpG is methylated). The methylation status of a nucleotide can be determined
by pyrosequencing (Sect. 17.1.3), but that technique is limited to relatively short
nucleotide sequences. A more recent method relies on treating DNA with bisulfite
(under acidic conditions cytosine is converted to uracil, and methylated cytosine is
not) and comparing the sequence with the untreated one. 12 Even newer is the tech-
nique called MethylCap-seq 13: The DNA is sonicated, fragmenting it to pieces with a
length of around 300 base pairs, which are then exposed to MBD-GST immobilized
on magnetic beads, which captures methylated fragments at low concentrations of
NaCl; a gradient of increasing salt concentration elutes the DNA fragments from the
beads. Epigenetic profiling is of growing importance to medicine. 14
17.3
Gene Identification
Gene identification (or “gene finding”) is the process of identifying regions in the
genome that are likely to correspond to genes, using a combination of computational
algorithms, statistical analysis, and other bioinformatics tools. Other features, such
as regulatory elements and splice sites, may assist the finding process. The ultimate
goal of gene identification (or “gene prediction”) is automatic annotation: to identify
all biochemically active portions of the genome by algorithmically processing the
sequence and to predict the reactions and reaction products of those portions coding
for proteins. At present we are still some way from this goal. Success will not only
allow one to discover the functions of natural genes but should also enable the
biochemistry of new, artificial sequences to be predicted and, ultimately, to prescribe
the sequence necessary to accomplish a given function.
In eukaryotes, the complicated exon–intron structure of the genome makes it par-
ticularly difficult to predict the course of the key operations of transcription, splic-
ing, and translation from a sequence alone (even without the possibility that essential
instructions encoded in acylation of histones, etc. are transmitted epigenetically from
generation to generation).
Challenges remain in identifying the exons, introns, promoters, and so on in each
stretch of DNA, such that the exons could be grouped into genes and the promoters
12 Bibikova et al. (2006); Bibikova and Fan (2010).
13 Brinkman et al. (2010); for other methods, see Zuo et al. (2009).
14 See, e.g., Heyn and Esteller (2012).